This project was created for Doctor Ken Steif’s MUSA 507 course in Fall 2018.
1. Introduction 2. Data 3. Method of Variable Selection and The Finalised Model 4. The Final Model: Regression Results and Discussion 5. Cross-validation Checks and Discussion 6. Extra Credit: Spatial Cross Validation 7. Concluding Remarks and Reflections on Class Discussion
Nashville is among the fastest growing real estate markets in the country, with rising home prices and and increasing influx of people from out of state. Additionally, numerous factors make this place more attractive to would-be residents, including job opportunities, quality of life, amenities and redevelopment. With a growing popularity of people interested in buying homes in Nashville, prospoective homeowners find it important to accurately predict the value of home prices in order to make sure they end up on positive equity for their investments. As planners, it is important to study the changing patterns in the housing market and accurately idenfity where these changes occur in order to better plan for inclusive, equitable and thriving cities. By expecting where prices changes are likely occur, planners can prepare for stronger policies that will both support the growing market while also consider the people’s ability to afford living in that area. Stricter policies may restrict homeowners from raising too much of their house values for sale, keeping from only the wealthy to own properties in the region; where sale prices are low, planners can look for development opportunites to make the region more attractive for new homeowners to invest in. With increased population projections for Nashville in the upcoming years1, it is important for planners to appropriately be able to accommodate the growing number of people with affordable home prices.
The purpose of this project is to built a predictive model of home prices in Nashville using machine learning algorithms. Using open data sources such as Zillow, Nashville Open Data, and the US Census Bureau, the information was aggragated and analyzed using R.
An overall 5 Step Process was used to create a final prediction model for home sale prices. The general 5 Step Process included:
Underlying this process is a conscious consideration of the Modifiable Areal Unit Problem (MAUP). Therefore, where possible, our data processing preferred point pattern analysis over polygon based analysis. In other words, spatial factors were based on the distance from each observation point rather than being aggregated by zone.
Several limitations made this exercise difficult to design. Our limited domain knowledge of the area made it challenging to determine which other factors beyond physical house properties more greatly influenced home prices. With a narrow understanding of how certain predictors are valued in relation to house prices, it was difficult to engineer the variables appropriately to adequately represent how variable distributions might approximate reality. Another difficulty for predicting house prices was tradeoff between accuracy and generalizability - while fitting our preliminary models, there were numerous instances of models predicting observed prices accurately in the training set, but not in the test set. Finally,the greatest challenge was to simplify the complexity of the housing market into a simple OLS Regression Model. Our understanding was that the relationships underlying house price phenomena are not commonly simple and linear - translating such relationships within the bounds of OLS assumptions proved to be a tricky process.
In brief, the final model showed that the most significant predictors for house prices was a combination of both physical property and geodemographical characteristics. It was found that the final predictive model accurately predicted on observed values but innaccurately predicted for unobserved preditions. In conclusion, it was found that although the model had significant predictors for house prices, it was not generalizable.
The initial dataset provided was limited to home owner information and physical house characteristics. With the understanding that house price values are also dependent upon other external spatial factors, we sought to expand the range of the current predictor variables to adequately represent the effect of space on house prices. To gather additonal data, we explored open data platforms such as the Nashville Open Data and the US Census Bureau. From these, additional datasets containing information on neighborhood characteristics, internal household composition, schools location, and police incidence accounts were scraped and processed.
The table below presents the complete list of 36 predictors considered for our final model. Ultimately, only 10 predictors were fitted, as highlighted in purple (for property characteristics) and in red (for spatial characteristics).
| Predictors |
|---|
| Property Predictors |
| Location Zip |
| Location City |
| Building type |
| Story height |
| Exterior wall material |
| Frame type |
| Central Air System |
| Heating type |
| Foundation type |
| Year built |
| Effective year |
| Physical Depreciation |
| Acrage |
| Number of rooms |
| Number of bedrooms |
| Number of other rooms |
| Number of bathrooms |
| Number of half baths |
| Total number of bathrooms |
| Spatial Predictors |
| Local Owner |
| Distance to public school |
| Distance to good school |
| Distance to Vanderbilt University |
| Distance to Business Improvement District (BID) |
| Distance to airport |
| Police incidents |
| Commercial development within mile-vicinity |
| Residential development within mile-vicinity |
| Distance to bus stops |
| North or South of river |
| Road zones |
| Median income |
| Percentage Black Community |
| Number of Establishments within mile-vicinity |
| Pay per Employee |
| Employment |
Upon exploring the given dataset, it was concluded that several of the column characteristcs could be aggregated into simpler predictor columns. For example, the Local Owner predictor was derived from a series of data manipulation from the raw dataset, where the columns showing Owner City and Location City were compared.
A similar process of addidng and subtracting values was used to further simplify internal house characteristics such as Number of Bedrooms, Number of Bathrooms, and Number of Other Rooms.
Despite the limitations of our domain knowledge about Nashville, through research we included more predictor variables that were thought to be influential factors in determining house prices. The following summarizes the process of how geodemographic variables in the final model were selected and engineered to appropriately fit the model.
Median Income: Each property point is associated with the median income of the block group it resides within. Median income is often thought to be a significant indicator in determining house prices, as the house price to income ratios reflect a measure of affordability. As such, higher income households are more likely to be able to afford higher priced homes while lower income households are less likely to buy more expensive homes.
Percent of Black Communities: Each property point is associated with the percentage size of Black communities of the block group it resides within. Gentrification is a well known concept among all city neighborhoods and certainly and issue to be considered. With the understanding that Nashville’s Race and Ethnicity is mostly characterized by a white dominant population, we sought to represent the lower minority groups, and explore whether or not that may an influential factor in driving home sale prices. Whereas 78% of the total population is white, only 15% is black or african american, with the rest of the population being composed on other minority groups such as asian and hispanic.2 Due to limited data on the lower minority populations, the percentage of black communities by block groups was used for the analysis.
Pay Per Employment: This predictor was created by dividing the total payment by census block groups per total number of employees in the same block group. This predictor characterizes the different employment types by distinguishing between high paying and low paying jobs. It assumes that white collar jobs yield a higher pay per employee than does blue collar jobs. By understanding how these points were distributed along the region, it was predicted that higher pay per employee be correlated with higher house prices as people would be willing to pay more to live within close distance to their workplaces as opposed to those who are paid less and would therefore not value living within close proximity to their workplaces as much.
Zones by Road: By overlaying the Sale Price maps onto a satellite map of the region, it was possible to see another spatial pattern based on the major roads going through Davidson County. Nothing that this could possibly be another spatial predictor influencing home prices, the county was classified based on a visual observation into road zones. Using ESRI’s ArcMap, Sale Prices points were overlayed on the county and roads layers. The zones were manually drawn out using the roads as boundaries.
Residential Development within mile-vicinity: The number of residential permit applications approved within the mile-vicinity of the property is calculated and used as a proxy representation of the extent of residential development. Increasing the supply of housing in an area possibly indicates increased popularity for that area. Thus, increased competition to own a house in that area ultimately drives the house values to go up. Residential developments, were therefore considered to be a significant factor to influence the prices of homes.
For more explanation on how the other predictor variables (not used in the final model) were selected, please refer to the appendix below.
As with all data, our data included null values or 0 values which accounted for NA’s. Due to the large number of observations with missing information, we attempted to predict for these values through the process of multiple hot-deck imputation, under the assumption that the null values were missing at random. This imputation method imputes the target value from observations similar in terms of other variables - the assumption here is that observations similar in terms of other variables will likely yield a similar value in the missing one.
The advised threshold for imputing missing values is at 5% for large datasets3. While most of our missing observations fell slighly above this threshold (around 6-7%), only one variable (Acrage) fell far beyond at 39%. It was hoped that the imputations for the other predictors would yield better results. We created two identical datasets - Dataset X without imputed values and Dataset Y with imputed values. Both were tested for when designing the final prediction model.
However, it was noted that missingness in this dataset was not at random. Instead, observations with systemic similarities yielded similar extent of missingness. This means that imputation methods might not improve fit, but worsen bias instead. Ultimately, for these reasons, we decided not to present predictions based off the imputed dataset.
To select variables from our initial set of 36, we began first by running a Kitchen Sink regression model where all 36 predictors were incorporated. Using this as a base model, the regression model summary allowed us to quickly eliminate insignificant factors and work with those who were considered to be significant predictors of house prices. This process was performed by looking at the p-value. The predictors with a p-value of more than 0.05 are considered to be insignificant predictors of house sale prices - this means that the relationship estimated is likely by chance, and will not be observed in other situations using other observations of house sales.
In testing out different combinations of predictors, different models were compared in their ability to explain the observed variations in house sale prices based on the R^2 values and RMSE (Root Mean Square Error) diagnostics. Firstly, the R^2 value represents a ‘goodness of fit’ measure for the linear regression model - a R^2 value directly represents the amount of price variations explained by the predictors. Secondly, the RMSE indicates how inaccurately the model predicts house sale prices based on the actual observed prices. A large RMSE indicates that the values predicted by the model deviate largely from the actual reality. Therefore, guiding our decision on a finalised model for further cross-validation tests is the ideal of a high R^2 value and a low RMSE.
Our final model sought to predict House Sale Price as a function of 1. Size of Property (in acres) 2. Number of Bedrooms 3. Number of Bathrooms 4. Number of Other rooms 5. Size of Black Community 6. Median Income of Area 7. Whether it is locally-owned 8. Annual Pay Per Employee in Area 9. Road Zone 10. Residential Development in Area. This following figures present the summary and exploratory statistics of the final fitted variables.
The table below presents the central tendency of variables fitted in the model - in other words, it reflects the average house characteristic and spatial situation in Nashville.
It can be observed that while the average house price in Nashville at $290258 is not exceedingly high, the large standard deviation indicates a large disparity in house prices in Nashville. There are a large spectrum of house prices in this city, and the central average house price cannot be used to make a generalisation that Nashville has largely affordable housing properties.
The average property in Nashville is 0.23 acres in size - again, the large standard deviation indicates that this statistic is not generalisable across Nashville. This average property is likely to have 3 bedrooms, 2 or 3 bathrooms, and 3 other rooms that are neither bedrooms or bathrooms.
The average property in Nashville tends to reside within a block group that is 27.4% Black, with a median income of$55973, in the south east part of the city. It is very likely owned by someone residing in Nashville instead of in other cities in the rest of the United States. It is also typically surrounded by a high number of potential residential developments, with 159 issued and approved permits for future residential projects. It typically resides in an area with a relatively high economic pay-off, with employees working in its area earning around $443100 per year.
| Variable | Central Tendency | Standard Deviation |
|---|---|---|
| Dependent Variable | ||
| Sale Price | $290258 | 333876.2 |
| Predictive Property Characteristics | ||
| Size of Property | 0.23 Acres | 0.72 |
| Number of Bedrooms | Three Bedrooms (48.5%) | |
| Number of Bathrooms | 2 or 3 Bathrooms (63.1%) | |
| Number of Other Rooms | 3 Other Rooms (34.6%) | |
| Predictive Geodemographic Characteristics | ||
| Size of Black Community in Block Group | 27.4% | 25.7 |
| Median Income of Block Group | $55973 | 26957.9 |
| Annual Pay Per Employee in Area | $4431000 | 1070325 |
| Local Owner | Yes (84.1%) | |
| Road Zone | South East (22.9%) | |
| Number of Residential Development in Mile-vicinity | 156.9 | 149.1 |
The interactive map below presents the distribution of house sale prices across Nashville. To better visualise the relative differences in house sale prices between properties, the logged Sale Price is also presented as a comparison layer - you can toggle between the two layers to observe this distribution for yourself!
From this interactive map, we can observe that similar house prices are often clustered spatially together. Three big spatial clusters of high house sale prices can be immediately observed north and south of the river, as well as at the southern boundary of the city.
## tmap mode set to interactive viewing
The spatial distributions of three predictors - Residential Development in area, Number of Bedrooms, Median Income of area - fitted in the model are presented likewise in interactive maps below. To aid your own visual exploration, we added the Sales Price layer for you to toggle between each predictor of interest and sales price!
From these maps, it can be observed that the Predictive Property Characteristic (Number of Bedrooms) seem to be distributed randomly across Nashville. On the other hand, the other two Predictive Geodemographic Characteristics display clear spatial variations. This indicates that such variables play an important role in driving the spatially-clustered patterns of house sale prices we observed in the previous map.